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Abstract 

We consider a model for adaptive evolution of a phenotype within a fam- 
ily of related species with unknown phylogeny. The unknown species tree is 
modelled by a Yule process conditioned on having n tips. The trait value is 
assumed to evolve along a lineage as an Ornstein-Uhlenbeck process char- 
acterized by the adaptation rate a > 0, the optimal trait value 6 and the noise 
size a > 0. For the vector of n trait values describing the outcome of such an 
evolution we study the moments of the sample mean and sample variance. 
Our analytical and simulation results lead to a simple confidence interval for 
6 when a is larger than half the speciation rate. 

Keywords : Phylogenetic comparative methods, Conditioned Yule branching 
process, Branching Ornstein-Uhlenbeck process, Phylogenetic confidence inter- 
vals, Phylogenetic tree metric 

1 Introduction 

Phylogenetic comparative methods are by now well established and provide im- 
portant tools for studying interspecies data. In short they provide a methodology 
to take into account the evolutionary relationships between observed species when 
studying their traits and potential relationships with the environment the species 
live in. Assuming a stochastic process under which the phenotype evolves on 
the tree, it is desirable to estimate the process' parameters and appropriately in- 
terpret them. As with any statistical procedure it is important to place confidence 
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intervals around the point estimates for the parameters of interest. However, confi- 
dence intervals are often not mentioned in the phylogenetic comparative methods 
literature as pointed out by Boettiger et al. (2012 ). 

There are a number of possible ways of calculating these confidence inter- 
vals and very often they may take intolerable amounts of time depending on the 
assumed stochastic model of evolution. If the tree is known and the estimation 
procedure can be put in a regression framework, then standard regression type 
confidence intervals may be used (e.g. |Hansen[ |1997| [Martins and Hansen] |1997 



Garland and Ives[ |2000[ |Rohlfj |2001| ). In more complicated situations a paramet- 



ric bootstrap is a way out ( |Boettiger et al.[ |2012[ ), used by e.g. |Ives et al] ( |2007| ) 
and implemented in the ouch package ( Butler and King[[2004[ ). Another approach 
is to consider the curvature of the likelihood surface (done by e.g. Bartoszek et al.[ 



in revision] ) in situations where a parametric bootstrap is too time consuming. In 



parametric settings with only few parameters involved, reporting a support region 
(Hansen, 1997 Hansen et al.[ 2008) is a significantly more attractive option. 

All of the above methods have in common that they assume that the phy- 
logeny describing the evolutionary relationships is fully resolved. Possible errors 
in the topology can cause problems - the closer to the tips they occur, the more 
problematic they can be ( |Symonds]|2002| ). Practically, in view of the wealth of se- 
quence information and the advancement in phylogeny reconstruction techniques, 
the assumption of a completely resolved species tree is justifiable. Also from a 
statistical point of view this might not be at first glance such a large problem as 
regression estimators will remain unbiased even with a misspecified tree (Rohlf , 
2006 ), and also seem to be robust with respect to errors in phylogeny (at least for 
the Brownian motion model of phenotype evolution, as noticed by |Stone , 201 1 ). 

A direction of study which seems to be underrepresented currently is connect- 
ing phylogenetic comparative methods, in particular their underlying stochastic 
models of evolution, with stochastic models of tree growth. There are of course 
exceptions to this, e.g., already in the 1970s, a joint maximum likelihood esti- 
mation procedure of a Yule tree and Brownian motion on top of it was proposed 
by Edwards ( 1970), and an MCMC procedure to jointly estimate the phylogeny 
and parameters of the Brownian model of trait evolution has been proposed by 
Huelsenbec k et al.| (|2000); Huelsen beck and Rannala| ( |2003[ ). 



Assuming a stochastic model for the tree's growth allows one to study the 
properties of a sample of species without the need to use a fixed phylogeny. From 
the point of view of a comparative analysis this can be useful in a number of situa- 
tions: if the phylogenetic tree is known to be poorly resolved, or if one wants to see 
how robust the results are with respect to uncertainty in the phylogeny. The case 



2 



of unresolved trees can appear e.g. when studying fossil data. There may be avail- 
able rich phenotypic information but the molecular material might have degraded 
so much that it is impossible to infer evolutionary relationships. "Integrating" over 
the phylogeny can be used as a sanity check, whether the conclusions based on the 
inferred phylogeny deviate much from those from a "typical" phylogeny. Results 
presented here can also be used as a method of testing software for phylogenetic 
comparative models. 

In this work we propose a simple phylogenetic confidence interval formula 
which takes into account the phylogenetic uncertainty. To this end we study prop- 
erties of the sample mean and sample variance 



X x + ...+X n 
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for the vector of trait values (X\, ... ,X n ) in the framework of what we call the 
Yule-Ornstein-Uhlenbeck (YOU) model. The YOU-model is characterized by 
four parameters (Xq, a, a, 6) and consists of two ingredients 

1. the species tree connecting n extant species is modeled by the pure birth 



Yule process (Yule 1924) with a unit speciation rate A = 1 and conditioned 
on having n tips, 

2. the observed values (X\ , . . . ,X n ) evolved from the ancestral state Xq follow- 
ing the Ornstein-Uhlenbeck (OU) process with parameters (a, a, 0). 



The conditioned Yule tree (Gernhard, 2008a) has a convenient backward de- 
scription in terms of n inter-speciation times (T ni T n -\ . . . ,T\) which are indepen- 
dent exponential random variables with parameters (n, n — 1 , . . . , 1). Here is the 
time during which the tree has k branches, so that 



?i + ...+r n 



(1) 



is the total height of the tree and the ancestral trait value Xq is attributed to the 
corresponding time which we call time to the origin, see Fig. [T] For more details 
on treating such combinations of exponential random variables the reader is re- 
ferred to |Feller| ( |1966p , Yule trees are discussed by e.g Ath reya and Ney| ([2000) 
and a consideration of applications of branching processes in biology is presented 
by e.g. |Haccou et al.| ( |2007l ). 

The OU-process X(t) described by the stochastic differential equation 



dX(t) = -a(X(t)-8)dt + odB(t), X(0)=X , 



(2) 
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Figure 1: On the left: a branching OU-process simulated on a random tree with 
n = 5 tips using the TreeSim ( |Stadler[ |25g9[ 12011) and mvSLOUCH ( |Bartoszek 
et al.[ |in revision]) software. Parameters used are a = 1, o = 1, Xq — 6 — 2, after 



the tree height T was scaled to 1. On the right: the species tree disregarding the 
trait values supplied with the notation for the inter-speciation times. For the pair 
of tips (2,3) the time x to their most recent common ancestor is marked on the 
time axis (starting at present and going back to the time of origin). 



can be interpreted (Han sen] |1997] |Butler and King] |2004] |Hansen and Orzack 

Labra et aL| 12009] ) as the evolution of a trait value 



2005, Hansen et all 2008 



along a lineage of species in terms of the adaptation rate a > 0, the optimal trait 
value 6, and the noise size o > 0. The distribution of X(t) is normal with 



E[X(t)] = e+e- at (X -e), Var[X(t)} = o z (l-e- 2at )/2a 



(3) 



implying that X(t) looses the effect of the ancestral state Xq at an exponential 
rate. In the long run the OU-process acquires a stationary normal distribution 
with mean and variance o 2 /2a. Notice that with a = there is no adaptation 
and the trait evolution is governed by the Brownian motion ( Felsenstem] 1985] ). 

Using these properties of the YOU-model we find explicit expressions for 
E [X n ] , Var [X n ] , and E [S%] and propose a simple formula for an asymptotic phylo- 
genetic confidence interval for the optimal value 6 as presented in Section [2] and 
supported by simulations, see Section[3] Our confidence interval formula Eq. ( fTTj ) 
is meant to incorporate not only the randomness in the trait evolution but also the 
uncertainty in the species tree. Even if we have a resolved tree using such an 
interval estimate allows one to see how much worse would one do, if there was 
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any potential error in the tree. Based on this one can decide whether a detailed 
recheck of phenotypic conclusions with respect to variations in the phylogeny is 
necessary or not. 

In Sections [4] and [5] we give the proofs of our main results summarized in 
Section |2j Section [5] contains new formulae for the Laplace transforms for the 
important characteristics of the conditioned Yule species tree: the time to origin 
T and the time x to the most recent common ancestor for a pair of two species 
chosen at random out of n extant species. 

Section[6]gives exact formulae for the first two moments for T and x which are 
applied to compute the mean of the squared splitted nodal distance between two 
independent realizations of the conditioned Yule model. Finally, in Section [7] we 
give recursive formulae for all moments of T and x as well as all joint moments 
ofT — x and x. 

2 Summary of main results 

In the framework of the YOU-model with parameters (Xq,(X,6,o) the random 
variables X n and E [S 12 ] have two sources of uncertainty: unknown species tree and 
stochastic trait evolution along the branches of the tree. In this section we present 
exact and approximate expressions for E \X n ], Var [X n ], and E [S„l . It is important 
to recognize that our standing assumption A = 1 of having one speciation event 
per unit of time causes no loss of generality. To incorporate an arbitrary speciation 
rate X one has to replace in our formulae parameters a and a 2 by & = aj X and 
a 2 = <7 2 / X . This transformation, in view of Eq. ([2]), is equivalent to a time scaling 
by factor A: if t = i/X, then adt = adt and odB(t) = ddB(i). This transformation 
changes neither the optimal value 6 nor the stationary variance o 2 / (2a). 
Under the YOU-model we show that 

E[X n }=b n! aX + (l-b n!a )e, 

where 

b - 1 2 " 
l+x 2+x n+x 

We see that the sample mean X n is an unbiased estimator of 

• the ancestral state Xq in the Brownian motion case, a 

• a weighted mean of the ancestral trait value and the optimal value when 
< a < °° which asymptotically, as the sample size increases, becomes the 
optimal value. 



(4) 
(5) 

= 0, since b„ o = 1 , 



5 



Note that even for moderate values of a the weighted mean Eq. ([4]) is close to 6. 
For example, with a = 1 we get E [X n ] = ^t[Xq + ^jO. 
Secondly, we find that 

Var [X„] = — \ 2a -l)n ~~ ( ° ~ Q >^ 2a ~ (6) 

In the singular case a = 1/2 Eq. (|6]) reads 

2 ^ 

Var %] = — (2a n - -fL) + (X - e) 2 (& n ,i -fc 2 5 ), 

where a ra = l + l/2+... + l/nis the n-th harmonic number. The last expression 
is obtained from Eq. ([6]) by applying the asymptotic formula 

l-(n+l)b n>x 

-> a n+ x - 1, * 1. 

x— 1 

Notice that as a — ► 0, using the fact that 

x~ (l-b niX ) ->a„, jc-)-0, 
we recover from Eq. (|6]) the formula for the Brownian motion model 

Var[X„]=a 2 (2-^), (7) 



obtained earlier by |Sagitov and Bartoszek| ( |20l2| ). 
Letting n — > °° in Eq. ([6]) and using 

K iX _ r(n+l) 



~ n x 



r(x+i) r(n+jc+i; 

we get a wide range of asymptotic regimes: 

2, if a = 0, 

v rvM . . Q. 5 -n- 2 «, if0<a<0.5, 
Var[X„]~^<( 2n l llnn) if a = .5, (8) 

2^TV if a > 0.5, 

where 8 = ^ X ° a 6 ^ is the normalized distance between the ancestral and optimal 
values and 

C Xj5 = Y^^r(2jc+l) + 2a5 2 r(2jc+l)-2a6 2 r 2 (jc+l). 



Thirdly, we compute 



E ^"2^ 1 + (2«-l)(n-l) )' (9) 



which in the case a = 0.5 turns into 



Letting a — >■ in Eq. (M) we arrive at the formula obtained by Sagitov and Bar 



toszek| ( |2012) for the Brownian motion model 



r ?i 9 f n + 1 2n \ 

E[S 2 n }=o 2 ( -a a . (10) 

L J \n — 1 n—l J 

Observe that for any fixed a > we have, 

E[X M ] = e + 0(8n- a ), 

making Z M and 5^ asymptotically unbiased estimators of the optimal state and 
the stationary variance of the OU process respectively. This is in accordance with 



the finding of Hansen (1997) where in a regression setup the residual sum of 
squares was used to estimate the stationary variance. Additionally in the case of 
sufficiently strong adaptation when a > 0.5 letting n — > °°, we have 

2a + 1 o 2 
VarW ~ 2a(2a-l) -T- 

Taking the last relation into account, when a > 0.5, we propose the following 
convenient approximate formula for a 95% confidence interval for the optimal 
trait value : 



K 2,oc + 1 

£,±1.96-S„~, K a = J- -, a>0.5. (11) 

Jn V 2a — 1 



The confidence interval from Eq. ( [TTj ) differs from the classical confidence inter- 
val for the mean (X n ± 1.965„/ y/n) just by a factor K a . The latter is larger than 1, 
as it should, in view of a positive correlation among the sample observations. The 
correction factor K a becomes negligible in the case of a very strong adaptation, 
a»l, when the dependence due to common ancestry can be neglected. 
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Figure 2: Simulation results supporting the confidence interval formula Eq. ( fTT| ). 
Simulations are performed for the YOU-model with X = \,a = 1, a =\,Xq = 1, 
9=0. Left: each dot gives the proportion of the 10000 confidence intervals 
covering 9 for a particular value of n. The horizontal line is the desired 95% 
coverage level. Right: the simulation results for Var [S 2 ,] decrease toward zero 
with n suggesting that S„ is a consistent estimator of the stationary variance. 



i_ LL LL LL 

U, LL X X ILL ILL 

Figure 3: Top row: histograms of X n for the Ornstein-Uhlenbeck model for 
different value of n (left to right: 10, 40, 70, 1 10, 140, 170), for each 10000 inde- 
pendent values of X n were simulated. In the simulations oc = 1,Xq = 1,0 = and 
a = 1. For all histograms the x and y axes have the same scale x E (—1.5, 1.5), 
y G (0, 4) . Bottom row: histograms of X n for the Brownian motion model with the 
same values of n as on the top row. In the simulations Xq = and o 2 = 1, the axes 
values here are in the range x E (— 10, 10) and y E (0, 0.35) . 
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3 Phylogenetic confidence intervals 



Our confidence interval formula Eq. ([TTj) is well supported by simulation results 



presented in Fig. |2j A strict mathematical justification of this formula hinges 
on two results which need to be addressed in a future study as they depend on 
properties of conditioned Yule trees that require further work. One has to prove 
that 

Var[5^]^0, n °o (12) 
and the following central limit theorem has to be verified, 

Conjecture 3.1 As n — > °° the standardized sample mean 

Z n = (X n -K[X n ])(Nv[X n ])- l l 2 (13) 
is asymptotically normally distributed with parameters (0,1). 



We performed extensive simulations finding support for Eq. ( |T2| ), see the right 
panel in Fig. [2j and our conjecture of asymptotic normality for Z n . The bell- 
shaped histograms in the first row of Fig. [3] illustrate the decrease of the variance 
predicted by the theory, see the left panel in Fig. |4j As shown on the central 
panel in Fig. |4} the Shapiro-Wilk test does not indicate significant deviations 
from normality, and moreover, the estimate of the excess kurtosis (not shown) 
oscillates around for large n. 



A more thorough check of normality for Eq. ( [13] ) is given by a hybrid Monte- 
Carlo-numerical approach described below in Algorithm [T] The key idea behind 
Algorithm [I] is that conditional on the species tree X n is normally distributed. 



Our R (R Development Core Team, 2010) implementation of this algorithm is 



available on request. The right panel in Fig. ^demonstrates that with increasing n 
the 0.975-quantile of the distribution of Z n approaches the 0.975-quantile of the 
standard normal distribution. 

Another example of a phylogenetic confidence interval can be suggested in 
the case of the Brownian motion model of evolution where the sample average is 
an unbiased estimator of the ancestral state Xq. Using Eq. |7]) and assuming a is 
known one can think of the following approximate 95% confidence interval for Xq 

o , 

X n ±q n — A/2n-a„, (14) 

V n 

whose symmetric form is justified by histograms in the second row of Fig. [3] 
From Fig. [5] we can see that the appropriate value of q„ should lie somewhere 
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Algorithm 1 Find approximate 1 — a(= 0.975 in our case) level quantile of X n 
1: Simulate N(= 10000 in our case) trees with n tips and specified birth-death 

rates using TreeSim ( |Stadler[|2009||ZPTl) . 
2: Conditioned on the tree i calculate the mean {jit) and variance (of) of X n 

under the desired stochastic process of phenotype evolution. 
3: Calculate the unconditional E [X n ] and Var [X n ] under the assumed stochastic 

models of tree growth and phenotype evolution. 
4: Use the R function optim() to minimize, 



1 N ( I _ 

( 1 " °0 - jj £ * [ E i*n] + 4 V Var &>] > ^' 



with respect to g, where ^>(-,jU, a 2 ) is the cumulative distribution function for 
a normal distribution with mean ji and variance o 2 . 



between the normal and t distribution quantiles. Obviously the width of the con- 
fidence interval of Eq. ( [T4| ) does not converge to zero as n — > °° (also observed 
by |Ane[ 2008 [ but with a different model of tree growth). The kurtosis estimates 



and Shapiro-Wilk test p-values indicate that the distribution of X n deviates from 
normality and this should be remembered when applying them to data, see Fig. 
|5J On the other hand, the histograms in Fig. [4] do not indicate that a practitioner 
should be overly worried about this. 



4 Derivation of key formulae 

In this section we derive Eqs. (|6]) and (|9]). Let X and Y be a pair of trait values 
randomly chosen (without replacement) from n available values (X\ 7 . ..,X n ). De- 
note by T the time until the most recent common ancestor for the pair of species 
behind the sampled trait values X and Y. Using conditioning on the time of origin 
T and Eq. (|3]) we obtain 

E [X] = E [E [X\T]} =E[6 + e- aT (X - 6)} = Q + (X Q - G) E [e^ 7 ] , (15) 
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Figure 4: Simulation results concerning X n for the YOU-model with a = 1, 
Xq = I, 6=0, o 2 = 1. Left: variance of the sample mean, simulated and theo- 
retical values as functions of n. Center: p-value of Shapiro-Wilk test for normal- 
ity ( Royston[ 1995[ indicates 0.1 as an adequate cut-off level). Right: 0.975 level 
quantile calculated according to Algorithm [T] (the inset in the graph is a zoom-in 
on the y-axis around the points). Each point is estimated from 10000 Yule trees, 
except for the Shapiro-Wilk test, where a subsample of 5000 trees was used due 
to the requirements of the shapiro.testQ R function. 



and 



Var [X] = E [Var [X | T] ] + Var [E [X | T] 

+ Var[(X -9)e- aT + 9] 



°T^-e- 2aT ) 



2a 



■\-E[e- 2aT ]) + (X -6) 2 Yar[e- aT ] . 



Since obviously E [X n ] = E [X], Eq. p5] ) yields 

E[X n ] = e + (X o -0)E[e- aT ]. 



(16) 



(17) 



Next, conditioning on T and the trait value X T of the most recent common 
ancestor of the randomly chosen pair of extant species, we get by applying Eq. 

Cov [X,Y] = E [Cov [X , Y | T, X r }\ + Cov [E [X | t, X z ] , E [Y | t, X t ] ] 
= Yar[(X t -e)e- aT + e] . 
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Figure 5: Characteristics of the distribution of X n for the Brownian motion model 
with Xq = and o 2 = 1. Top left: variance of X n , top centre: excess kurtosis 
estimate for standardized X n , top right: p-value of Shapiro-Wilk test for normal- 
ity, bottom left: 0.975-quantiles of standardized X n , bottom right: 0.975-quantile 
calculated according to Algorithm [T] Each point is estimated from 10000 inde- 
pendently simulated Yule trees, except for the Shapiro-Wilk test based on 5000 
trees. 
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Conditioning on (T, t) gives 

Var [(X T - B)e- at ] = E [ e - 2aT Var [X T |7\ t]] + Var [e~ aT E [X T - z}] 



-lax 



-2a(T-x) 



2a 



+ Var 



e- a \X o -0)e 



-a(T-T) 



Therefore it follows that 



Cov [X,Y] = y— (E [e~ 2aT ] - E [g- 2ar ] ) + (X - 6) 2 Var [ e - ar ] . 



(18) 



It was shown by Sagitov and Bartoszek (2012) that the variance of the sample 
average and the expectation of the sample variance can be compactly expressed 



Var[X„] 

.-21 



VarfXl 



n-1 



l-p n )Var[Z] 



(19) 
(20) 



E[SI] =(l-p n )Var[Z] 

in terms of the interspecies correlation coefficient p„ = Cov[X,7] /Var[X], first 
studied by Sagitov and Bartoszek (2012) for the Brownian motion model. In view 
of Eqs. ( fT6] ) and ( [T8] ), for the YOU-model the interspecies correlation takes the 
form 



a 

la 



[l-E[e- 2at ]) 



Pn 1 ^(l-E[e- 2aT }) + (X -e) 2 V a r[e-<* T ] 
Putting together Eqs. ( [To] ), ( fT9| ), ( [20] ), and (|2TJ) we arrive at 
~2 



(21) 



Var [Z n ] = ^- (1 - nE [e- 2ar ] + (n - 1)E [e" 2 ^] ) + (Z - 0) 2 Var [e^] , 

E[S3^(1-E[^]). 

Eqs. ([4]), ([6]), ([9]) readily follow from Eq. ( fTVj ) and the last two relations in view of 

E[ e -* r ] =b„s, (22) 

and 

2-( B+1 ) C v +1 )^ 
(n-l)(y-l) 

both derived in the next section. Note that Eq. ([23]) with y = 1 reads as 



E[e-] 



n-1 



1 

n+l 
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Figure 6: Simulation results in the case X = 1, a 2 = 1, Xq = 0, = 1, and 
n = 30. Regression lines fitted to simulated data (thick line) are indistinguishable 
from the true regression line y = P30X+ (1 — &3o,ct)(l — P30), where p^o is given 
by Eqs. p4| ) and ( |2~5] ) for different values of oc. From the left to the right: a = 
0.05, 0.5, 1, 5. 



and that Eq. (23 ) entails 



Var[e-« r ] =b n>2a -b 2 n>a . 

We conclude this section by recording the exact formula of the interspecies 
correlation coefficient for the YOU-model in the case of a / 0.5 



Pn = 1 



2a(n - 1) + (n + l)((2a + l)Z? Mi2 a - r 



(n- l)(2a - 1)(1 + (2a8 2 - \)b, h2a - 2a5 2 ^ ( 
and in the case of a = 0.5, 

n+1 n + 2 — 2a n +i 



(24) 



Pn = 1 - 



n — 



ln + 5 2 (l-(n+l)^ 05 ; 



(25) 



IX —6\ 

where 5 = 10 1 . This formula, illustrated by Fig. 4 should be compared to 
similar formulae due to Sag itov and Bartoszek| ( |2012 ) obtained for a = when 
the unknown species tree is modelled by a generalization of the conditioned Yule 
process, a conditioned birth-death process allowing for extinction of species. 



5 Laplace transform for (7, t) 

The conditioned birth-death processes as stochastic models for species trees, have 
received significant attention in the last decade (e.g. Aldous and PopovTc] 2005 
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Gernhardj |2008al |Stadlerj [20091 |20~TT1 |Mooers et all |20T2t |Stadler and Steel) 



2012[). Here we contribute to these studies by computing the joint Laplace trans- 



form of, T, the height of a Yule tree conditioned on n tips at present and x, the 
height of the most recent common ancestor for two randomly sampled tips (for 
results concerning the distribution of x and T see also Gernhardj 2008a|b[ Stadler[ 
2009). Figure [T] visualizes these random variables. This also allows us to give 



recursive formulae for the joint moments of T and x. 

Recall that we consider the Yule process with a unit speciation rate A = 1. It is 
easy to extend the result below to the case of an arbitrary speciation rate A: simply 
replace the times T and x by T = TX and t = xX. 

Lemma 5.1 Let K n (1 < K n < n — 1 ) be the number of the splits in the tree ( counted 
forward in time) corresponding to the coalescent of two randomly sampled tips. 
Then 

2(n+l) 



P{Kn ~ k) ~ (n-l)(k + 2)(k+V 
It follows that for any positive x and y we have 



l,...,n-l. 



-x(T-n)-yt 



2(n+l)b n , y \} 



(n-1) 



k=l 



h, x 

(k + 2)(k+l)b k . 



(26) 



(27) 



implying Eqs. (22 ) and (23 ). 



PROOF Tracing the lineages of two randomly sampled tips of the Yule tree towards 
the root, the coalescent can be viewed as the success in a sequence of independent 
Bernoulli trials. This argument leads to the expression 



PfK. 



1 



1 



1 



1 



yielding Eq. ( |26| ) after expanding the binomial coefficients. Eq. (|27J) follows 
easily from Eq. ([5]) and 



-x{T-x)-yz 



n-1 



i 



k+i 



k=\ 



x+\ x+ky+k+\ 



y + n 



which is a consequence of the representation, in Eq. ([T]), 



Ti + ... + T, 



K„, 



Tk„+i + ... + T n . 



15 



Putting y = x in Eq. ( |27] ) we obtain Eq. ( |2"2] ). Plugging jc = in Eq. ( f27| ) we 

get 

L J (^ 2 ) "' y+« 

2(n+l)! ^T^+l+j) 



( n _i)r (};+n+ i)^ i r(fc+3) • 

When j = 1 this directly becomes, 

E[e~ T ] =^—(a n -l)- 1 



m — 1 n + 1 

and for y ^ 1 we use the following relation (easily verified by induction when 

z^y) 

n y X r(k + y) = r(i» + z)r(y+l)-r(z+l)r(n + y) 
^r^ + z+l) r{z+l)T(n + z )(z-y) 

to derive Eq. ( |23] ) 



E[e~y*\ 



2I> + l+y)-I> + 2)r(y + 2) _ 2- (w+ l)(y + l)^y 
(n-l)r(y+n+l)(y-l) (n-l)(y-l) ' 

□ 



6 First two moments of T and T 

In Section [7] using Eq. ( [27] ) we present a recursive procedure for obtaining all the 
joint moments of T and x. In particular, the first two moments are 

E[T]=a n , E[T 2 ]=a 2 n +H na , 

n+1 In r9l n + 1 , o N 4(a n f« + l) — n) 

Et = f a„ r, E T 2 = fl 2 + ff J L 

n—\ n — 1 L J n— 1 ' n— 1 

E[rr] = — (fl2-^„. 2 )-2(fl„-l), 
n 

v , n?i ,n+ 1 1 . 8a„ 2 . 9 

e [ ( r - ,f\ = 4(— «, 2 + _) - - jl + (a ; + „„, 2) , 
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Figure 7: Simulated and true values of E[T] top left, Var[T] top center, E[t] 
top right, Var [t] bottom left, Cov [T, t] bottom centre, log(E \d(Tree' ', Tree") 2 ] ) 
bottom right. Each point comes from 10000 simulated Yule trees. 



where H n 2 = E \ is the generalized harmonic number of order 2. The results of 
this section are illustrated on Fig. [7] Applying directly verifiable identities 

"y a-k _ n-a„ ^ 
h\ (k+l)(k + 2) ~ n+1 ^ ' 

v 1 4 _ rj 1 1 K + i) 2 . ^ 2 1 1 

i fe(Jfc+l)(* + 2)" J ""' 2 + 1 n + 1 ^ 6 +i ' 

we get the following asymptotic behaviour as n — > °° 

E[r]~lnn, E[r]~lnn, E[T-t]->2, 

E[r 2 ]~ln 2 n, E[T 2 ]~ln 2 n, E[rr]~ln 2 n, E [(T - r) 2 ] -> 
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Using the first two moments we get, 
War[T]=H niZ , 

Var [t] = . — ((n 2 - \)H, h2 - 2(n + l)a 2 n + 4(n + l)a„ - 4n) , (29) 
(n — i)^ 

Cov [r, t] = -^r(2n - (n+ 1)^,2), 
n — 1 

so that as n — > oo 

2 2 2 

Var[r]->^-, Var[r]^^-, Cov [T, t] -> 2 - ^, 

which gives the limiting relationship Corr [r, t] — >■ ^| — 1 ~ 0.216. 

As an illustration, we apply the obtained results to derive the expectation of a 
splitted nodal distance between two independent realizations (Tree', Tree") of the 
conditioned Yule model tree, 

2n 

E[d{Tree',Tree") 2 ] =- — - ((n 2 - \)H n>2 - 2(n + 1)(< - 2a n ) - An) . (30) 
Given two trees, each with n labelled tips, the function 
d(Tree',Tree") = a \l £ (%[ 



i<i<j<n 



'J 



T 



■"\2 



(31) 



where T,j is the height of the most recent common ancestor of tips i and j in the 



netic trees (Cardona et al.[ 2010). Eq. pi) is the weighted-tree counterpart of the 


path-difference metric (see 


Steel and Penny, 1993; Cardona et al. 


2010, 


Mir and 


Rossello 2010; Mir et al.[ 2012 


) 



d(Tree ', Tree 



\<i< j<n 



where lL, l"j are the number of speciation events on the path connecting tip nodes 
i and j in the appropriate tree. 

To prove Eq. < |30] ) it is enough to notice that according to Eq. pTj ) we can 
write 

E [d{ Tree', Tree") 2 ] = 2 Q E [(t' - x") 2 ] , 
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where %' and x" are two independent versions of the random variable X. Since 

E[(x'-x") 2 ] =2Var[r], 



Eq. OOb follows from Eq. (29). Note also that Eq. (BOb yields as n -)■ oo 



7T 2 2 



E[rf(7W,7Yee") 2 ] ~ —n 



7 All moments of T and r 



Eq. ( |22 ) for the Laplace transforms of the random variable T can be used to 



calculate the moments of T using 

E[T m ] = (-\) m (d m E [e~ xT ] /dx m )U=o- (32) 
For a fixed n we introduce the following notation 

A(x) = ^— — , 
v ; x+1 x + n 

r f N 1 1 

M*) = 7 , -,w + • ■ • + 



m ' 



(X+I) ra (jC + /l) 

b w (x) = (&i(x),...,Z? m (x)). 

Notice that A (0) = 1/n! and £> ra (0) = //„, m is the n-th generalized harmonic num- 
ber of order m 

n y 

H n ,m = £ (33) 
(=1 



We can write Eq. < |22] ) as E [e~ xr ] = m!A(jc). Its first derivative with respect 
to x is — n\A(x)b\{x), and the second derivative is n\A(x)(b\(x) 2 -\-biix)). For the 
general recursive formula we introduce the following notation. We will denote 
by k = (fci,&2, ■ • ■) infinite dimensional vectors with integer-valued components, 
and write k 6 $rf m if all ki > and |k| := Y!iL\ hi = m. Therefore stf m represents 
the set of all possible ways to represent m as a sum of positive integers. We will 
also use the multi-index notation b m (x) k = b\ (x) kl ■ b m (x) m . 

Since A'(x) = —A(x)b\{x), and b' m (x) = —mb m+ i(x), we can show by induc- 
tion that, 

r) m 

^—E [e- xT ] = (~l) m n\A(x) £ c k b ra (*) k , (34) 
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where coefficients Ck are defined for all vectors k = (k\,k2,---) with integer- 
valued components using the recursion 

in 

c k = + l)ckj, (35) 

with m = |k| and 

Cfc,0 = C(k 1 -l,k 2 ,k 3 ,...), 
CkJ=C( felr .. ) ^ + l ) fe j . +1 _l ) ...), j> 1. 



The boundary conditions for the recursion of Eq. ( [35] ) consist of two parts: 

• Ck = 0, if all kj = 0, or one of the coordinates of the vector k is negative, 

• Ck = 1 if k\ > 1 and all other kj = 0. 



We conclude from Eq. (34 ) that 



ke^„ i=l 

The technique for calculating the m-th derivative of the Laplace transform of 
x given by Eq. ( f23| ) is the same but requires new notation 

1 1 1 

My) 



v-1 y + 2 y + n' 
bmiy) = 7 rr~ + 7 — + . . . + 



(y + 2) m (v + n) m 

Notice that A' (y) = -A(y)bi(y), V m {y) = -mb m +\{y), A(0) = -n! and S m (0) = //„,, 
if m is even or b m (0) = H n . m — 2 if m is odd. One can then inductively show that, 

d ' n f r,-*i - ( ~ 1)m2m! (-i) w+1 (^+i)' AMr , , vY » 

+ £ c k b m (v) k ), 
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with the coefficients Ck defined as previously by Eq. ([35]). Therefore, we get 



in 



o | in 

e k] =—-{a n - ir + 1 c k n - it* n 

K€zSX m 1=1 1=1 

k\ <m i °dd i even 



Similarly we can use Eq. ( [27] ) to calculate the joint moments for T — x and x 
in terms of 

A^\x ' 1 



x+i+l x + j 



For m > 1 and r > 1 we first get 



r 



E 



c -x(r-T)-yT 



1 



\in+r 



2(n + l)! 



and then from the above 



E[(r-T) w T r ] = (-i 



,m+r ■* 



71-1 

n— 1 j 



X 



e (/+1 ; (/+2) e <- k n«':, e ^ii"- ».• 
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